May 21 2025
IOT
In recent years, sound classification has become a valuable tool in real-time monitoring systems, particularly for public safety and smart surveillance. One powerful model for sound classification is YAMNet, developed by Google. In this blog post, we will walk you through how we used YAMNet along with a custom-trained dataset to build an emergency sound detection system capable of identifying critical sounds like gunshots, glass breaks, and animal attacks.
YAMNet (Yet Another Mobile Network) is a pre-trained deep learning model that classifies audio into a wide range of categories based on the AudioSet ontology by Google. It uses a MobileNet v1 architecture, which is lightweight yet efficient, making it suitable for real-time audio classification tasks.
We built an emergency sound detection system that works in real-time and can identify high-risk sounds such as:
Here’s the basic workflow we followed:
Dataset Preparation: Added a varied dataset of emergency sounds such as gunshot, glass break, howls, etc, from various sources.
Feature Extraction using YAMNet: We used YAMNet to extract embeddings (1024-dimensional feature vectors) from each audio file in the dataset. These embeddings represent the key characteristics of each sound.
Training a Custom Classifier: We trained a machine learning model (such as Random Forest or MLP) using the extracted embeddings to classify emergency sounds into specific categories.
Real-Time Detection: We implemented a live microphone input system that:
Implementation Overview
After designing the workflow, we implemented the project using Python and TensorFlow. The implementation was divided into three main parts:
1. Audio Feature Extraction using YAMNet
We used the pre-trained YAMNet model to extract embeddings from my emergency audio dataset. These embeddings are high-level features that represent the sound in a compact form.
# Example: Extract embeddings from an audio file
import tensorflow as tf
import yamnet as yamnet_model
import soundfile as sf
import numpy as np
# Load the YAMNet model
yamnet = yamnet_model.yamnet_frames_model()
yamnet.load_weights('yamnet.h5')
# Load the audio
waveform, sr = sf.read('sample_gunshot.wav')
if sr != 16000:
# Resample if needed
pass
# Extract embeddingsscores, embeddings, spectrogram = yamnet(waveform)
2. Training the Custom Classifier
Once we had the embeddings, we trained a custom classifier using scikit-learn. This allowed me to fine-tune detection for only a few important classes, such as gunshot, glass_break, and dog_bark.
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import train_test_split
# Load your labeled embeddings
X = ... # YAMNet embeddings
y = ... # Corresponding labels
# Train/test split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)
# Train classifier
clf = RandomForestClassifier()
clf.fit(X_train, y_train)
3. Real-Time Detection System
For real-time detection, we created a script that listens to the microphone input, processes small audio chunks using YAMNet, and classifies them with the trained model. When it detects a critical sound, it shows an alert.
import sounddevice as sd
def callback(indata, frames, time, status):
# Preprocess live audio input
# Extract YAMNet embeddings
# Predict class with trained classifier
# Display alert if necessary
pass
# Start microphone stream
sd.InputStream(callback=callback).start()
This system can be used in a variety of safety-focused environments:
By combining YAMNet’s powerful feature extraction with a custom-trained classifier tailored for emergency audio events, we’ve built a scalable and accurate detection system. This solution can be integrated into smart security systems, healthcare monitoring tools, and public safety infrastructure to improve response times and reduce risks.
The modular design allows it to adapt to different scenarios with minimal changes. As we continue to improve accuracy and reduce false positives, this system can become a reliable layer of safety for both consumers and enterprises.